Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 320: support clean shutdown #953

Merged
merged 5 commits into from
May 25, 2020
Merged

Issue 320: support clean shutdown #953

merged 5 commits into from
May 25, 2020

Conversation

benoit74
Copy link
Contributor

Tries to solve #320

Feedback, code review, suggestion more than welcome.

Code is functional (i.e. it has been tested in live conditions inside a K8S cluster.

drain = start a clean shutdown, i.e. terminate ongoing commands and refuse to start new ones.

Changes :

  • new endpoint "POST /drain" on localhost:4141 to start the drain
  • new endpoint "GET /drain" on localhost:4141 to get the status of the frain
  • new command "atltantis drain" to start the drain and wait for the status to be completed (typically intended to be called from a preStop hook when running inside a K8S cluster)
  • when the user tries to start a command while a drain is already ongoing, the action is ignored and a message is displayed in his VCS: "Atlantis server is shutting down, please try again later."

@benoit74
Copy link
Contributor Author

I just broke some tests which are not initialized correctly now that I added some dependencies, I will fix this.

@codecov
Copy link

codecov bot commented Mar 24, 2020

Codecov Report

Merging #953 into master will increase coverage by 0.05%.
The diff coverage is 76.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #953      +/-   ##
==========================================
+ Coverage   71.98%   72.03%   +0.05%     
==========================================
  Files          65       68       +3     
  Lines        5411     5486      +75     
==========================================
+ Hits         3895     3952      +57     
- Misses       1210     1225      +15     
- Partials      306      309       +3     
Impacted Files Coverage Δ
cmd/drain.go 0.00% <0.00%> (ø)
server/events/command_runner.go 50.58% <71.42%> (+1.21%) ⬆️
server/server.go 64.03% <77.77%> (+0.37%) ⬆️
server/drain_controller.go 80.00% <80.00%> (ø)
server/events/drainer.go 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12801dd...fa64779. Read the comment docs.

@benoit74
Copy link
Contributor Author

I'm done with the CI, ready for review.

@benoit74
Copy link
Contributor Author

@lkysow : Could you please have a look at this or assign someone ?
PS : the chart update is ready as well (https://github.com/benoit74/charts/tree/atlantis_gracefull_shutdown) but I wait for this to be merged + the new docker image to be released to create the chart PR to avoid to release the chart update before the binary is ready

@lkysow
Copy link
Member

lkysow commented Mar 27, 2020

This looks great! I'm wondering if we need the drain command though? Can the POST endpoint block until drain is complete? Then the lifecycle hook can just be curl -x POST localhost:4141/drain.

@lkysow
Copy link
Member

lkysow commented Mar 27, 2020

I'm also wondering if there are some other tools out there that follow this pattern that I can look at?

@benoit74
Copy link
Contributor Author

benoit74 commented Mar 27, 2020 via email

@lkysow
Copy link
Member

lkysow commented Mar 27, 2020

Kubernetes is draining nodes of running pods as well, e.g. for performing
an update of the kubelet, the OS, hardware failures, ... No idea where this
is implemented in the codebase to be honest.

I meant more other tools that are running on Kube and how they solve this problem. For example I wonder if they have a separate command or just an endpoint that blocks.

@MRinalducci
Copy link
Contributor

Hook handler implementation in Kub supports also HTTP, but only with GET method.
Seems a common way to do it too:

apiVersion: v1
kind: Pod
metadata:
name: client
spec:
containers:
 - image: mynginx
  name: client
  lifecycle:
   preStop:
    httpGet:
    port: 8080
    path: /shutdown

If we need POST method this have to be done with exec command + curl like you already mentioned.
I wouldn't implement the command if there is no added value than doing it directly trough the endpoint. Maybe the sleep part of the command aka returning when drain completed?

https://github.com/benoit74/atlantis/blob/72e54dca0df83a18bf19bc42270e69877defc7db/drain/drain.go#L29

References:

@benoit74
Copy link
Contributor Author

benoit74 commented Mar 28, 2020

Hu, ok, did not got your question right.

Find below what I've found looking in helm charts preStop (not many to be honest).

  • cassandra : nodetool decommission
  • elaticsearch : pre-stop-hook.sh (not found what is inside the script)
  • etcd : Shell script for few checks, some data extraction (member name, ...) and finally etcdctl member remove
  • kong : kong quit (which in turns performs few checks, sends a signal to nginx, and force kill if needed)
  • nginx : use of signals (but this is generalized to the whole interaction with nginx, e.g. to reload the configuration there is also a signal, ...)
  • graylog : curl -X POST http://localhost:9000/api/system/shutdown/shutdown/api/system/shutdown/shutdown

@MRinalducci
The added value of the "atlantis drain" is indeed to wait for the completion of the drain. This is what k8s expect, i.e. the preStop command must not return before the pod can be terminated. Since it could takes many minutes, I was quite unconfortable to implement this "wait" in an HTTP method (long lived HTTP connections is not that reliable usually).

@MRinalducci
Copy link
Contributor

Hi @lkysow any news about this PR? 😄

@lkysow
Copy link
Member

lkysow commented Apr 29, 2020

Hey I've just been too busy to tackle this. I'll get to it when I can.

@MRinalducci
Copy link
Contributor

Ok no problem, thanks for the update 👍

@lkysow lkysow merged commit fa64779 into runatlantis:master May 25, 2020
@lkysow
Copy link
Member

lkysow commented May 25, 2020

Hi @benoit74, I've merged this as part of #1051 however I made a couple of changes:

  1. I removed the POST /drain endpoint and instead am using SIGTERM/SIGINT. I did this because a) I was worried about the security of having the /drain endpoint exposed and b) in my testing using the signal works with kubernetes and graceperiodseconds
  2. Removed the atlantis drain command since it's not necessary anymore
  3. I renamed the drain controller to status controller and changed the endpoint to /status so it can be a generic status endpoint.

I would loved to have worked with you on your code via reviews rather than coding on top however I don't have as much time as I'd like to work on code reviews and you opened this so long ago and I know it's annoying to have to reload all the context.

Hopefully this mechanism works for you and if there are any issues then please let me know and we can look to fix.

@benoit74
Copy link
Contributor Author

Hi @lkysow
Thank you very much for this work.
I see no reason why this wouldn't work for us, we will let you know if we face any issue.
I'm totally fine with this approach as well, and your arguments makes a lot of sense to me.
No worry about the fact that we didn't worked together on this, you got it right.
Do you have any idea of a timeframe regarding when this would be released ? (so that we stop to deploy a custom build ;o))
Thanks in advance

@lkysow
Copy link
Member

lkysow commented May 26, 2020

I'll get a release out this week for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants